Customer Segmentation and LTV Prediction

In this section, we try to do customer segmentation with RFM (Recency, Frequency and Monetary Value) scores. Since we have the data for the whole year of 2017, we seperate it into two sets: January-September and Octobor-December. Then we will do customer segmentation for the first 9 months and predict the customer life time value (LTV) for the last 3 months.

Customer Segmentation (First 9 Months)
We calculate the recency (how many days past since the last transaction), frequency (how many transactions within 9 months) and log revenue for all transactions. Then we do clustering for these 3 features so that we have 3 clustering features. We re-order the cluster order the make the biggest cluster number represents the best results. Thus, we calculate the overall score by summing up these 3 scores. Then we segment the customer into 4 groups according to the overall scores.

LTV Prediction (Last 3 Months)
First, we use the log revenue of the last 3 months to do a LTV clustering. Then we use the 9-month features to predict the 3-month LTV.

In [1]:
from __future__ import division
from datetime import datetime, timedelta,date
import pandas as pd
%matplotlib inline
from sklearn.metrics import classification_report,confusion_matrix
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
from sklearn.cluster import KMeans

import chart_studio.plotly as py
import plotly.offline as pyoff
import plotly.graph_objs as go

import xgboost as xgb
from sklearn.model_selection import KFold, cross_val_score, train_test_split

import warnings
warnings.filterwarnings("ignore")
In [2]:
pyoff.init_notebook_mode()
In [3]:
# load data
df = pd.read_pickle('2017_clean.pkl')
train = df.copy()
In [4]:
# keep related information and change data type
train = train[['fullVisitorId', 'date', 'totals.totalTransactionRevenue']]
train["date"] = pd.to_datetime(train["date"], format="%Y%m%d") # seting the column as pandas datetime
train['totals.totalTransactionRevenue']=train['totals.totalTransactionRevenue'].astype('float')
In [5]:
train.head()
Out[5]:
fullVisitorId date totals.totalTransactionRevenue
0 3162355547410993243 2017-10-16 NaN
1 8934116514970143966 2017-10-16 NaN
2 7992466427990357681 2017-10-16 NaN
3 9075655783635761930 2017-10-16 NaN
4 6960673291025684308 2017-10-16 NaN
In [6]:
train['date'].describe()
Out[6]:
count                  928860
unique                    365
top       2017-12-12 00:00:00
freq                     9234
first     2017-01-01 00:00:00
last      2017-12-31 00:00:00
Name: date, dtype: object
In [7]:
# select data from Jan to Sept
train_9 = train[(train.date < date(2017,10,1)) & (train.date >= date(2017,1,1))].reset_index(drop=True)
# select data from Oct to Dec
train_12 = train[(train.date >= date(2017,10,1)) & (train.date <= date(2017,12,31))].reset_index(drop=True)
In [8]:
train_9['date'].describe()
Out[8]:
count                  636781
unique                    273
top       2017-09-20 00:00:00
freq                     4880
first     2017-01-01 00:00:00
last      2017-09-30 00:00:00
Name: date, dtype: object
In [9]:
train_12['date'].describe()
Out[9]:
count                  292079
unique                     92
top       2017-12-12 00:00:00
freq                     9234
first     2017-10-01 00:00:00
last      2017-12-31 00:00:00
Name: date, dtype: object
In [10]:
customer = pd.DataFrame(train['fullVisitorId'].unique())
customer.columns = ['fullVisitorId']

Recency

In [11]:
max_purchase = train_9.groupby('fullVisitorId').date.max().reset_index()
In [12]:
max_purchase.columns = ['fullVisitorId','MaxPurchaseDate']
In [13]:
max_purchase['Recency'] = (max_purchase['MaxPurchaseDate'].max() - max_purchase['MaxPurchaseDate']).dt.days
In [14]:
user_recency = pd.merge(customer, max_purchase[['fullVisitorId','Recency']], on='fullVisitorId')
In [15]:
user_recency.head()
Out[15]:
fullVisitorId Recency
0 8934116514970143966 4
1 6135613929977117121 1
2 9630953897602496525 19
3 8461088577726398196 14
4 8574773000077178719 136
In [16]:
user_recency.Recency.describe()
Out[16]:
count    495989.000000
mean        127.851523
std          80.999227
min           0.000000
25%          54.000000
50%         126.000000
75%         198.000000
max         272.000000
Name: Recency, dtype: float64
In [17]:
plot_data = [
    go.Histogram(
        x=user_recency['Recency']
    )
]

plot_layout = go.Layout(
        title='Recency'
    )
fig = go.Figure(data=plot_data, layout=plot_layout)
pyoff.iplot(fig)
In [18]:
# Elbow Method
sse={}
recency = user_recency[['Recency']]
for k in range(1, 10):
    kmeans = KMeans(n_clusters=k, max_iter=1000).fit(recency)
    recency["clusters"] = kmeans.labels_
    sse[k] = kmeans.inertia_ 
plt.figure()
plt.plot(list(sse.keys()), list(sse.values()))
plt.xlabel("Number of cluster")
plt.show()
In [19]:
# let's try 4 first
kmeans = KMeans(n_clusters=4)
kmeans.fit(user_recency[['Recency']])
user_recency['RecencyCluster'] = kmeans.predict(user_recency[['Recency']])
In [20]:
user_recency.groupby('RecencyCluster')['Recency'].describe()
Out[20]:
count mean std min 25% 50% 75% max
RecencyCluster
0 117733.0 96.196818 20.020799 63.0 79.0 95.0 114.0 132.0
1 116766.0 236.708854 19.870417 203.0 220.0 236.0 253.0 272.0
2 140699.0 29.319469 18.341870 0.0 12.0 29.0 45.0 62.0
3 120791.0 168.246335 20.092493 133.0 151.0 169.0 185.0 202.0
In [21]:
def order_cluster(cluster_field_name, target_field_name,df,ascending):
    new_cluster_field_name = 'new_' + cluster_field_name
    df_new = df.groupby(cluster_field_name)[target_field_name].mean().reset_index()
    df_new = df_new.sort_values(by=target_field_name,ascending=ascending).reset_index(drop=True)
    df_new['index'] = df_new.index
    df_final = pd.merge(df,df_new[[cluster_field_name,'index']], on=cluster_field_name)
    df_final = df_final.drop([cluster_field_name],axis=1)
    df_final = df_final.rename(columns={"index":cluster_field_name})
    return df_final
In [22]:
# call the funtion above to order cluster
user_recency = order_cluster('RecencyCluster', 'Recency', user_recency,False)
In [23]:
user_recency.groupby('RecencyCluster')['Recency'].describe()
Out[23]:
count mean std min 25% 50% 75% max
RecencyCluster
0 116766.0 236.708854 19.870417 203.0 220.0 236.0 253.0 272.0
1 120791.0 168.246335 20.092493 133.0 151.0 169.0 185.0 202.0
2 117733.0 96.196818 20.020799 63.0 79.0 95.0 114.0 132.0
3 140699.0 29.319469 18.341870 0.0 12.0 29.0 45.0 62.0
In [24]:
user_recency.groupby('RecencyCluster')['Recency'].agg(['count','mean']).reset_index()
Out[24]:
RecencyCluster count mean
0 0 116766 236.708854
1 1 120791 168.246335
2 2 117733 96.196818
3 3 140699 29.319469
In [25]:
user_recency.groupby('RecencyCluster')['Recency'].hist()
plt.xlabel('Recency')
plt.ylabel('Count')
plt.title('Recency_Cluster')
Out[25]:
Text(0.5, 1.0, 'Recency_Cluster')

Frequency

In [26]:
user_frequency = train_9.groupby('fullVisitorId').date.count().reset_index()
In [27]:
user_frequency.columns = ['fullVisitorId','Frequency']
In [28]:
user_frequency.head()
Out[28]:
fullVisitorId Frequency
0 0000027376579751715 1
1 0000039460501403861 1
2 0000040862739425590 2
3 0000049363351866189 3
4 0000062267706107999 1
In [29]:
user = pd.merge(user_recency, user_frequency, on='fullVisitorId')
In [30]:
user.head()
Out[30]:
fullVisitorId Recency RecencyCluster Frequency
0 8934116514970143966 4 3 3
1 6135613929977117121 1 3 6
2 9630953897602496525 19 3 1
3 8461088577726398196 14 3 3
4 7122741899604173060 2 3 2
In [31]:
user.Frequency.describe()
Out[31]:
count    495989.000000
mean          1.283861
std           1.337847
min           1.000000
25%           1.000000
50%           1.000000
75%           1.000000
max         195.000000
Name: Frequency, dtype: float64
In [32]:
plot_data = [
    go.Histogram(
        x=user['Frequency']
    )
]

plot_layout = go.Layout(
        title='Frequency'
    )
fig = go.Figure(data=plot_data, layout=plot_layout)
pyoff.iplot(fig)
In [33]:
# Elbow Method
sse={}
frequency = user[['Frequency']]
for k in range(1, 15):
    kmeans = KMeans(n_clusters=k, max_iter=1000).fit(frequency)
    frequency["clusters"] = kmeans.labels_
    sse[k] = kmeans.inertia_ 
plt.figure()
plt.plot(list(sse.keys()), list(sse.values()))
plt.xlabel("Number of cluster")
plt.show()
In [34]:
kmeans = KMeans(n_clusters=4)
kmeans.fit(user[['Frequency']])
user['FrequencyCluster'] = kmeans.predict(user[['Frequency']])
In [35]:
user = order_cluster('FrequencyCluster', 'Frequency',user,True)
In [36]:
user.groupby('FrequencyCluster')['Frequency'].describe()
Out[36]:
count mean std min 25% 50% 75% max
FrequencyCluster
0 470010.0 1.097385 0.296482 1.0 1.00 1.0 1.00 2.0
1 24609.0 3.980007 1.404383 3.0 3.00 3.0 4.00 9.0
2 1336.0 14.991018 6.341935 10.0 11.00 13.0 17.00 51.0
3 34.0 89.029412 33.107250 54.0 64.25 81.5 101.25 195.0
In [37]:
user.groupby('FrequencyCluster')['Frequency'].agg(['count','mean']).reset_index()
Out[37]:
FrequencyCluster count mean
0 0 470010 1.097385
1 1 24609 3.980007
2 2 1336 14.991018
3 3 34 89.029412

Monetary Value

In [38]:
revenue = train_9.groupby('fullVisitorId')['totals.totalTransactionRevenue'].sum().reset_index()
In [39]:
revenue.columns = ['fullVisitorId', 'Revenue']
revenue['logRevenue'] = np.log(1+revenue['Revenue'])
revenue = revenue.drop(['Revenue'], axis=1)
revenue.head()
Out[39]:
fullVisitorId logRevenue
0 0000027376579751715 0.0
1 0000039460501403861 0.0
2 0000040862739425590 0.0
3 0000049363351866189 0.0
4 0000062267706107999 0.0
In [40]:
user = pd.merge(user, revenue, on='fullVisitorId')
user.head()
Out[40]:
fullVisitorId Recency RecencyCluster Frequency FrequencyCluster logRevenue
0 8934116514970143966 4 3 3 1 16.612182
1 6135613929977117121 1 3 6 1 0.000000
2 8461088577726398196 14 3 3 1 0.000000
3 9056866253625889952 2 3 8 1 0.000000
4 1858833000939214139 1 3 4 1 0.000000
In [41]:
user.logRevenue.describe()
Out[41]:
count    495989.000000
mean          0.264691
std           2.167142
min           0.000000
25%           0.000000
50%           0.000000
75%           0.000000
max          25.966048
Name: logRevenue, dtype: float64
In [42]:
# the elbow method show 2
kmeans = KMeans(n_clusters=2)
kmeans.fit(user[['logRevenue']])
user['logRevenueCluster'] = kmeans.predict(user[['logRevenue']])
In [43]:
user = order_cluster('logRevenueCluster', 'logRevenue',user,True)
In [44]:
user.groupby('logRevenueCluster')['logRevenue'].describe()
Out[44]:
count mean std min 25% 50% 75% max
logRevenueCluster
0 488670.0 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
1 7319.0 17.937419 1.125177 14.731802 17.147358 17.746355 18.515718 25.966048
In [45]:
user.groupby('logRevenueCluster')['logRevenue'].agg(['count','mean']).reset_index()
Out[45]:
logRevenueCluster count mean
0 0 488670 0.000000
1 1 7319 17.937419

Overall Segmentation

In [46]:
user.head()
Out[46]:
fullVisitorId Recency RecencyCluster Frequency FrequencyCluster logRevenue logRevenueCluster
0 8934116514970143966 4 3 3 1 16.612182 1
1 0012276352424581690 16 3 5 1 17.804124 1
2 9179364299608933418 8 3 3 1 19.412153 1
3 3700714855829972615 5 3 8 1 19.586889 1
4 5334896187622326438 6 3 4 1 17.281246 1
In [47]:
user['OverallScore'] = user['RecencyCluster'] + user['FrequencyCluster'] + user['logRevenueCluster']
In [48]:
user.groupby('OverallScore')['Recency','Frequency','logRevenue'].mean().reset_index()
Out[48]:
OverallScore Recency Frequency logRevenue
0 0 236.827423 1.082989 0.000000
1 1 171.096818 1.185677 0.121840
2 2 100.516996 1.248590 0.240814
3 3 33.505855 1.253311 0.227977
4 4 35.643388 4.056357 3.662390
5 5 33.389222 8.849634 12.900719
6 6 25.857143 23.529101 16.950539
7 7 14.000000 130.200000 19.778151

Looks like overall score 7 is the highest value customer.

In [49]:
user['Segment'] = 'Low-Value'
user.loc[user['OverallScore']>0,'Segment'] = 'Mid1-Value' 
user.loc[user['OverallScore']>3,'Segment'] = 'Mid2-Value' 
user.loc[user['OverallScore']>5,'Segment'] = 'High-Value' 
In [50]:
sg = user.groupby('Segment')['logRevenue'].agg(['count','mean']).sort_values(by=['mean']).reset_index()
sg.columns = ['Segment','count','mean_logRevenue_9']
sg
Out[50]:
Segment count mean_logRevenue_9
0 Low-Value 111388 0.000000
1 Mid1-Value 373482 0.198362
2 Mid2-Value 10925 4.933348
3 High-Value 194 17.023416
In [51]:
user_sample = user.sample(20000)
In [52]:
# tx_graph = user_sample

# plot_data = [
#     go.Scatter(
#         x=tx_graph.query("Segment == 'Low-Value'")['Recency'],
#         y=tx_graph.query("Segment == 'Low-Value'")['Frequency'],
#         mode='markers',
#         name='Low',
#         marker= dict(size= 7,
#             line= dict(width=1),
#             color= 'blue',
#             opacity= 0.8
#            )
#     ),
#         go.Scatter(
#         x=tx_graph.query("Segment == 'Mid1-Value'")['Recency'],
#         y=tx_graph.query("Segment == 'Mid1-Value'")['Frequency'],
#         mode='markers',
#         name='Mid1',
#         marker= dict(size= 9,
#             line= dict(width=1),
#             color= 'green',
#             opacity= 0.5
#            )
#     ),
#             go.Scatter(
#         x=tx_graph.query("Segment == 'Mid2-Value'")['Recency'],
#         y=tx_graph.query("Segment == 'Mid2-Value'")['Frequency'],
#         mode='markers',
#         name='Mid2',
#         marker= dict(size= 9,
#             line= dict(width=1),
#             color= 'orange',
#             opacity= 0.5
#            )
#     ),
#         go.Scatter(
#         x=tx_graph.query("Segment == 'High-Value'")['Recency'],
#         y=tx_graph.query("Segment == 'High-Value'")['Frequency'],
#         mode='markers',
#         name='High',
#         marker= dict(size= 11,
#             line= dict(width=1),
#             color= 'red',
#             opacity= 0.9
#            )
#     ),
# ]

# plot_layout = go.Layout(
#         yaxis= {'title': "Frequency"},
#         xaxis= {'title': "Recency"},
#         title='Segments'
#     )
# fig = go.Figure(data=plot_data, layout=plot_layout)
# pyoff.iplot(fig)
In [53]:
# tx_graph = user_sample

# plot_data = [
#     go.Scatter(
#         x=tx_graph.query("Segment == 'Low-Value'")['Recency'],
#         y=tx_graph.query("Segment == 'Low-Value'")['logRevenue'],
#         mode='markers',
#         name='Low',
#         marker= dict(size= 7,
#             line= dict(width=1),
#             color= 'blue',
#             opacity= 0.8
#            )
#     ),
#         go.Scatter(
#         x=tx_graph.query("Segment == 'Mid1-Value'")['Recency'],
#         y=tx_graph.query("Segment == 'Mid1-Value'")['logRevenue'],
#         mode='markers',
#         name='Mid1',
#         marker= dict(size= 9,
#             line= dict(width=1),
#             color= 'green',
#             opacity= 0.5
#            )
#     ),
#             go.Scatter(
#         x=tx_graph.query("Segment == 'Mid2-Value'")['Recency'],
#         y=tx_graph.query("Segment == 'Mid2-Value'")['logRevenue'],
#         mode='markers',
#         name='Mid2',
#         marker= dict(size= 9,
#             line= dict(width=1),
#             color= 'orange',
#             opacity= 0.5
#            )
#     ),
#         go.Scatter(
#         x=tx_graph.query("Segment == 'High-Value'")['Recency'],
#         y=tx_graph.query("Segment == 'High-Value'")['logRevenue'],
#         mode='markers',
#         name='High',
#         marker= dict(size= 11,
#             line= dict(width=1),
#             color= 'red',
#             opacity= 0.9
#            )
#     ),
# ]

# plot_layout = go.Layout(
#         yaxis= {'title': "logRevenue"},
#         xaxis= {'title': "Recency"},
#         title='Segments'
#     )
# fig = go.Figure(data=plot_data, layout=plot_layout)
# pyoff.iplot(fig)
In [54]:
# tx_graph = user_sample

# plot_data = [
#     go.Scatter(
#         x=tx_graph.query("Segment == 'Low-Value'")['Frequency'],
#         y=tx_graph.query("Segment == 'Low-Value'")['logRevenue'],
#         mode='markers',
#         name='Low',
#         marker= dict(size= 7,
#             line= dict(width=1),
#             color= 'blue',
#             opacity= 0.8
#            )
#     ),
#         go.Scatter(
#         x=tx_graph.query("Segment == 'Mid1-Value'")['Frequency'],
#         y=tx_graph.query("Segment == 'Mid1-Value'")['logRevenue'],
#         mode='markers',
#         name='Mid1',
#         marker= dict(size= 9,
#             line= dict(width=1),
#             color= 'green',
#             opacity= 0.5
#            )
#     ),
#             go.Scatter(
#         x=tx_graph.query("Segment == 'Mid2-Value'")['Frequency'],
#         y=tx_graph.query("Segment == 'Mid2-Value'")['logRevenue'],
#         mode='markers',
#         name='Mid2',
#         marker= dict(size= 9,
#             line= dict(width=1),
#             color= 'orange',
#             opacity= 0.5
#            )
#     ),
#         go.Scatter(
#         x=tx_graph.query("Segment == 'High-Value'")['Frequency'],
#         y=tx_graph.query("Segment == 'High-Value'")['logRevenue'],
#         mode='markers',
#         name='High',
#         marker= dict(size= 11,
#             line= dict(width=1),
#             color= 'red',
#             opacity= 0.9
#            )
#     ),
# ]

# plot_layout = go.Layout(
#         yaxis= {'title': "logRevenue"},
#         xaxis= {'title': "Frequency"},
#         title='Segments'
#     )
# fig = go.Figure(data=plot_data, layout=plot_layout)
# pyoff.iplot(fig)

Life Time Value Prediction

In [55]:
user.head()
Out[55]:
fullVisitorId Recency RecencyCluster Frequency FrequencyCluster logRevenue logRevenueCluster OverallScore Segment
0 8934116514970143966 4 3 3 1 16.612182 1 5 Mid2-Value
1 0012276352424581690 16 3 5 1 17.804124 1 5 Mid2-Value
2 9179364299608933418 8 3 3 1 19.412153 1 5 Mid2-Value
3 3700714855829972615 5 3 8 1 19.586889 1 5 Mid2-Value
4 5334896187622326438 6 3 4 1 17.281246 1 5 Mid2-Value
In [56]:
train_12.head()
Out[56]:
fullVisitorId date totals.totalTransactionRevenue
0 3162355547410993243 2017-10-16 NaN
1 8934116514970143966 2017-10-16 NaN
2 7992466427990357681 2017-10-16 NaN
3 9075655783635761930 2017-10-16 NaN
4 6960673291025684308 2017-10-16 NaN
In [57]:
revenue_12 = train_12.groupby('fullVisitorId')['totals.totalTransactionRevenue'].sum().reset_index()
revenue_12.columns = ['fullVisitorId','Revenue_12']
revenue_12['logRevenue_12'] = np.log(1+revenue_12['Revenue_12'])
revenue_12 = revenue_12.drop(['Revenue_12'], axis=1)
In [58]:
revenue_12.head()
Out[58]:
fullVisitorId logRevenue_12
0 0000000259678714014 0.0
1 0000117255350596610 0.0
2 0000118334805178127 0.0
3 000020731284570628 0.0
4 0000232022622082281 0.0
In [59]:
# merge 9-month features with 3-month revenue
merge = pd.merge(user, revenue_12, on='fullVisitorId', how='left')
In [60]:
merge.head()
Out[60]:
fullVisitorId Recency RecencyCluster Frequency FrequencyCluster logRevenue logRevenueCluster OverallScore Segment logRevenue_12
0 8934116514970143966 4 3 3 1 16.612182 1 5 Mid2-Value 16.950570
1 0012276352424581690 16 3 5 1 17.804124 1 5 Mid2-Value 21.522470
2 9179364299608933418 8 3 3 1 19.412153 1 5 Mid2-Value 0.000000
3 3700714855829972615 5 3 8 1 19.586889 1 5 Mid2-Value 17.527862
4 5334896187622326438 6 3 4 1 17.281246 1 5 Mid2-Value 0.000000
In [61]:
merge = merge.fillna(0)
In [62]:
sg_12 = merge.groupby('Segment')['logRevenue_12'].agg(['count','mean']).sort_values(by=['mean']).reset_index()
sg_12.columns = ['Segment','count','mean_logRevenue_12']
sg_12
Out[62]:
Segment count mean_logRevenue_12
0 Low-Value 111388 0.000311
1 Mid1-Value 373482 0.005806
2 Mid2-Value 10925 0.191444
3 High-Value 194 1.820208
In [63]:
merge.shape
Out[63]:
(495989, 10)
In [64]:
kmeans = KMeans(n_clusters=4)
kmeans.fit(merge[['logRevenue_12']])
merge['LTVCluster'] = kmeans.predict(merge[['logRevenue_12']])
In [65]:
merge = order_cluster('LTVCluster', 'logRevenue_12',merge,True)
In [66]:
merge.groupby('LTVCluster')['logRevenue_12'].describe()
Out[66]:
count mean std min 25% 50% 75% max
LTVCluster
0 495733.0 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
1 112.0 17.055235 0.548509 14.731802 16.713340 17.148251 17.510059 17.719300
2 98.0 18.394497 0.398253 17.779746 18.025636 18.310745 18.755153 19.266077
3 46.0 20.319989 0.595112 19.396183 19.912267 20.279084 20.712651 21.578307
In [67]:
ltv = merge.groupby('LTVCluster')['logRevenue_12'].agg(['count','mean']).reset_index()
ltv.columns = ['LTVCluster','count','mean_logRevenue_12']
ltv
Out[67]:
LTVCluster count mean_logRevenue_12
0 0 495733 0.000000
1 1 112 17.055235
2 2 98 18.394497
3 3 46 20.319989
In [68]:
cluster = merge.copy()
In [69]:
cluster.head()
Out[69]:
fullVisitorId Recency RecencyCluster Frequency FrequencyCluster logRevenue logRevenueCluster OverallScore Segment logRevenue_12 LTVCluster
0 8934116514970143966 4 3 3 1 16.612182 1 5 Mid2-Value 16.950570 1
1 3700714855829972615 5 3 8 1 19.586889 1 5 Mid2-Value 17.527862 1
2 0267692164162304307 19 3 3 1 16.165933 1 5 Mid2-Value 17.599018 1
3 4921798700570421930 59 3 5 1 17.087358 1 5 Mid2-Value 17.001863 1
4 1956114514458489728 0 3 6 1 17.656467 1 5 Mid2-Value 16.498585 1
In [70]:
# dummy Segment
cluster['Segment_High']=0
cluster.loc[cluster['Segment']=='High-Value','Segment_High']=1
cluster['Segment_Mid1']=0
cluster.loc[cluster['Segment']=='Mid1-Value','Segment_Mid1']=1
cluster['Segment_Mid2']=0
cluster.loc[cluster['Segment']=='Mid2-Value','Segment_Mid2']=1
cluster['Segment_Low']=0
cluster.loc[cluster['Segment']=='Low-Value','Segment_Low']=1
In [71]:
cluster = cluster.drop(['Segment'],axis=1)
cluster.head()
Out[71]:
fullVisitorId Recency RecencyCluster Frequency FrequencyCluster logRevenue logRevenueCluster OverallScore logRevenue_12 LTVCluster Segment_High Segment_Mid1 Segment_Mid2 Segment_Low
0 8934116514970143966 4 3 3 1 16.612182 1 5 16.950570 1 0 0 1 0
1 3700714855829972615 5 3 8 1 19.586889 1 5 17.527862 1 0 0 1 0
2 0267692164162304307 19 3 3 1 16.165933 1 5 17.599018 1 0 0 1 0
3 4921798700570421930 59 3 5 1 17.087358 1 5 17.001863 1 0 0 1 0
4 1956114514458489728 0 3 6 1 17.656467 1 5 16.498585 1 0 0 1 0
In [72]:
# corelation with 3-month (Oct-Dec) LTVCluster
corr_matrix = cluster.corr()
corr_matrix['LTVCluster'].sort_values(ascending=False)
Out[72]:
LTVCluster           1.000000
logRevenue_12        0.942943
Segment_High         0.089669
logRevenue           0.076189
logRevenueCluster    0.071305
Frequency            0.065218
Segment_Mid2         0.064250
FrequencyCluster     0.048360
OverallScore         0.038971
RecencyCluster       0.022761
Segment_Low         -0.011026
Segment_Mid1        -0.015308
Recency             -0.025665
Name: LTVCluster, dtype: float64

Building Model (lightGBM)

In [135]:
X = cluster.drop(['LTVCluster','logRevenue_12','fullVisitorId'],axis=1)
y = cluster['LTVCluster']
In [136]:
X_columns = cluster.drop(['LTVCluster','logRevenue_12','fullVisitorId'],axis=1).columns
y_column = ['LTVCluster']
In [75]:
import lightgbm as lgb
from imblearn.over_sampling import SMOTE
In [139]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=56)
In [140]:
# using SMOTE for unbalanced data
X_train, y_train= SMOTE().fit_resample(X_train, y_train)
In [141]:
# fitting lightGBM
params = {"objective" : "multiclass",
          "num_class": 4,
          "metric" : "multi_error",
          "num_leaves" : 30,
          "min_child_weight" : 50,
          "learning_rate" : 0.1,
          "bagging_fraction" : 0.7,
          "feature_fraction" : 0.7,
          "reg_alpha": 0.15,
          "reg_lambda": 0.15,
          "min_child_weight": 50,
          "bagging_seed" : 420,
          "verbosity" : -1
         }
lg_train = lgb.Dataset(X_train, label=y_train)
lg_test = lgb.Dataset(X_test, label=y_test)
model = lgb.train(params, lg_train, 1000, valid_sets=[lg_test], early_stopping_rounds=50, verbose_eval=100)
Training until validation scores don't improve for 50 rounds.
[100]	valid_0's multi_error: 0.0436299
[200]	valid_0's multi_error: 0.0403335
Early stopping, best iteration is:
[210]	valid_0's multi_error: 0.0402226
In [142]:
y_pred = model.predict(X_test)
In [143]:
y_pred[0]
Out[143]:
array([9.95e-01, 4.57e-03, 2.93e-05, 9.91e-06])
In [144]:
# change y_pred format by select the index of maximum probability
y_pred_list = [int(np.where(i == np.amax(i))[0]) for i in y_pred]
y_pred_ar = np.array(y_pred_list)
y_pred_ar
Out[144]:
array([0, 0, 0, ..., 0, 0, 0])
In [145]:
print(classification_report(y_test, y_pred_ar))
              precision    recall  f1-score   support

           0       1.00      0.96      0.98     99152
           1       0.00      0.30      0.00        20
           2       0.00      0.14      0.00        14
           3       0.03      0.08      0.05        12

    accuracy                           0.96     99198
   macro avg       0.26      0.37      0.26     99198
weighted avg       1.00      0.96      0.98     99198

Adding Features

1. Continent

In [86]:
ft = df.copy()
ft["date"] = pd.to_datetime(ft["date"], format="%Y%m%d")
# select data from Jan to Sept
ft = ft[(ft.date < date(2017,10,1)) & (ft.date >= date(2017,1,1))].reset_index(drop=True)
ft.shape
Out[86]:
(636781, 38)
In [147]:
ft1 = ft[['fullVisitorId','geoNetwork.continent']]
ft1 = ft1.join(pd.get_dummies(ft1['geoNetwork.continent'])).drop(['(not set)'],axis = 1).drop(['geoNetwork.continent'], axis=1)

ft1 = ft1.groupby(['fullVisitorId']).agg(['max'])
ft1.columns = ['_'.join(col).strip() for col in ft1.columns.values]
ft1 = ft1.reset_index()

ft1.head()
Out[147]:
fullVisitorId Africa_max Americas_max Asia_max Europe_max Oceania_max
0 0000027376579751715 0 1 0 0 0
1 0000039460501403861 0 1 0 0 0
2 0000040862739425590 0 1 0 0 0
3 0000049363351866189 0 0 1 0 0
4 0000062267706107999 0 1 0 0 0
In [148]:
# merge continent features with cluster
ft1_df = pd.merge(cluster, ft1, on='fullVisitorId')
ft1_df.head()
Out[148]:
fullVisitorId Recency RecencyCluster Frequency FrequencyCluster logRevenue logRevenueCluster OverallScore logRevenue_12 LTVCluster Segment_High Segment_Mid1 Segment_Mid2 Segment_Low Africa_max Americas_max Asia_max Europe_max Oceania_max
0 8934116514970143966 4 3 3 1 16.612182 1 5 16.950570 1 0 0 1 0 0 1 0 0 0
1 3700714855829972615 5 3 8 1 19.586889 1 5 17.527862 1 0 0 1 0 0 1 0 0 0
2 0267692164162304307 19 3 3 1 16.165933 1 5 17.599018 1 0 0 1 0 0 1 0 0 0
3 4921798700570421930 59 3 5 1 17.087358 1 5 17.001863 1 0 0 1 0 0 1 0 0 0
4 1956114514458489728 0 3 6 1 17.656467 1 5 16.498585 1 0 0 1 0 0 1 0 0 0
In [149]:
X = ft1_df.drop(['LTVCluster','logRevenue_12','fullVisitorId'],axis=1)
y = ft1_df['LTVCluster']
In [150]:
X_columns = ft1_df.drop(['LTVCluster','logRevenue_12','fullVisitorId'],axis=1).columns
y_column = ['LTVCluster']
In [151]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=56)
In [152]:
# using SMOTE for unbalanced data
X_train, y_train = SMOTE().fit_resample(X_train, y_train)
In [153]:
# change X, y to dataframe so that we can keep column names
X_train = pd.DataFrame(X_train, columns=X_columns)
y_train = pd.DataFrame(y_train, columns=y_column)
In [154]:
# take a look at y after SMOTE, 4 clusters have the same samples now
y_train.hist()
Out[154]:
array([[<matplotlib.axes._subplots.AxesSubplot object at 0x00000264A2DF71D0>]],
      dtype=object)
In [155]:
# fitting lightGBM
params = {"objective" : "multiclass",
          "num_class": 4,
          "metric" : "multi_error",
          "num_leaves" : 30,
          "min_child_weight" : 50,
          "learning_rate" : 0.1,
          "bagging_fraction" : 0.7,
          "feature_fraction" : 0.7,
          "reg_alpha": 0.15,
          "reg_lambda": 0.15,
          "min_child_weight": 50,
          "bagging_seed" : 420,
          "verbosity" : -1
         }
lg_train = lgb.Dataset(X_train, label=y_train)
lg_test = lgb.Dataset(X_test, label=y_test)
model = lgb.train(params, lg_train, 1000, valid_sets=[lg_test], early_stopping_rounds=50, verbose_eval=100)
Training until validation scores don't improve for 50 rounds.
[100]	valid_0's multi_error: 0.0407065
[200]	valid_0's multi_error: 0.0412609
Early stopping, best iteration is:
[177]	valid_0's multi_error: 0.0348596
In [156]:
y_pred = model.predict(X_test)
In [157]:
y_pred[0]
Out[157]:
array([9.89e-01, 1.14e-02, 4.81e-05, 1.99e-05])
In [158]:
# change y_pred format by select the index of maximum probability
y_pred_list = [int(np.where(i == np.amax(i))[0]) for i in y_pred]
y_pred_ar = np.array(y_pred_list)
y_pred_ar
Out[158]:
array([0, 0, 0, ..., 0, 0, 0])
In [159]:
print(classification_report(y_test, y_pred_ar))
              precision    recall  f1-score   support

           0       1.00      0.97      0.98     99152
           1       0.00      0.25      0.01        20
           2       0.00      0.21      0.00        14
           3       0.03      0.08      0.04        12

    accuracy                           0.97     99198
   macro avg       0.26      0.38      0.26     99198
weighted avg       1.00      0.97      0.98     99198

In [160]:
fig, ax = plt.subplots(figsize=(8,5))
lgb.plot_importance(model, height=0.8, ax=ax)
ax.grid(False)
plt.ylabel('Feature', size=12)
plt.xlabel('Importance', size=12)
plt.title("Importance of the Features of our LightGBM Model", fontsize=12)
plt.show()

2. Continent and Country-US

In [101]:
ft2 = ft[['fullVisitorId','geoNetwork.country']]
ft2['country_US']=0
ft2.loc[ft2['geoNetwork.country']=='United States','country_US']=1
ft2 = ft2.drop(['geoNetwork.country'],axis=1)

ft2.head()
Out[101]:
fullVisitorId country_US
0 1228032379240126503 0
1 1088084052558051564 0
2 3615794190344553521 0
3 8127234129777839476 0
4 7139860291376515793 0
In [161]:
# check whether a customer only stay in US/not in US, but not for both
us = ft2.groupby(['fullVisitorId']).agg(['sum'])
us.columns = ['_'.join(col).strip() for col in us.columns.values]
us = us.reset_index()

us_merge = pd.merge(user_frequency, us, on='fullVisitorId')

us_merge['us_only'] = (us_merge['Frequency']==us_merge['country_US_sum'])
us_merge.loc[us_merge['Frequency']== 1,'us_only']=True
us_merge.loc[us_merge['country_US_sum']== 0,'us_only']=True
us_merge['us_only'] = us_merge['us_only']*1
In [162]:
# number of customer have been both in US and not in US
print(us_merge[us_merge['us_only']==0]['fullVisitorId'].count())

# presentage of customer have been both in US and not in US
print(us_merge[us_merge['us_only']==0]['fullVisitorId'].count()/us_merge.fullVisitorId.count())
40051
0.0807497746925839

Since majority of people (99.8%) will staying in US not not in US through all the trasaction, we can consider whether in US as a stable attribute for a customer.

In [163]:
ft2 = ft2.groupby(['fullVisitorId']).agg(['max'])
ft2.columns = ['_'.join(col).strip() for col in ft2.columns.values]
ft2 = ft2.reset_index()
ft2.columns = ['fullVisitorId', 'country_US']

ft2.head()
Out[163]:
fullVisitorId country_US
0 0000027376579751715 1
1 0000039460501403861 0
2 0000040862739425590 1
3 0000049363351866189 0
4 0000062267706107999 0
In [164]:
# merge continent features with cluster
ft2_df = pd.merge(ft1_df, ft2, on='fullVisitorId')
ft2_df.head()
Out[164]:
fullVisitorId Recency RecencyCluster Frequency FrequencyCluster logRevenue logRevenueCluster OverallScore logRevenue_12 LTVCluster Segment_High Segment_Mid1 Segment_Mid2 Segment_Low Africa_max Americas_max Asia_max Europe_max Oceania_max country_US
0 8934116514970143966 4 3 3 1 16.612182 1 5 16.950570 1 0 0 1 0 0 1 0 0 0 1
1 3700714855829972615 5 3 8 1 19.586889 1 5 17.527862 1 0 0 1 0 0 1 0 0 0 1
2 0267692164162304307 19 3 3 1 16.165933 1 5 17.599018 1 0 0 1 0 0 1 0 0 0 1
3 4921798700570421930 59 3 5 1 17.087358 1 5 17.001863 1 0 0 1 0 0 1 0 0 0 1
4 1956114514458489728 0 3 6 1 17.656467 1 5 16.498585 1 0 0 1 0 0 1 0 0 0 1
In [165]:
X = ft2_df.drop(['LTVCluster','logRevenue_12','fullVisitorId'],axis=1)
y = ft2_df['LTVCluster']
In [166]:
X_columns = ft2_df.drop(['LTVCluster','logRevenue_12','fullVisitorId'],axis=1).columns
y_column = ['LTVCluster']
In [167]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=56)
In [168]:
# using SMOTE for unbalanced data
X_train, y_train = SMOTE().fit_resample(X_train, y_train)
In [169]:
# change X, y to dataframe so that we can keep column names
X_train = pd.DataFrame(X_train, columns=X_columns)
y_train = pd.DataFrame(y_train, columns=y_column)
In [170]:
# take a look at y after SMOTE, 4 clusters have the same samples now
y_train.hist()
Out[170]:
array([[<matplotlib.axes._subplots.AxesSubplot object at 0x00000264A2E20DA0>]],
      dtype=object)
In [171]:
# fitting lightGBM
params = {"objective" : "multiclass",
          "num_class": 4,
          "metric" : "multi_error",
          "num_leaves" : 30,
          "min_child_weight" : 50,
          "learning_rate" : 0.1,
          "bagging_fraction" : 0.7,
          "feature_fraction" : 0.7,
          "reg_alpha": 0.15,
          "reg_lambda": 0.15,
          "min_child_weight": 50,
          "bagging_seed" : 420,
          "verbosity" : -1
         }
lg_train = lgb.Dataset(X_train, label=y_train)
lg_test = lgb.Dataset(X_test, label=y_test)
model = lgb.train(params, lg_train, 1000, valid_sets=[lg_test], early_stopping_rounds=50, verbose_eval=100)
Training until validation scores don't improve for 50 rounds.
[100]	valid_0's multi_error: 0.0438214
[200]	valid_0's multi_error: 0.0423396
Early stopping, best iteration is:
[172]	valid_0's multi_error: 0.0381762
In [172]:
y_pred = model.predict(X_test)
In [173]:
y_pred[0]
Out[173]:
array([9.86e-01, 1.41e-02, 8.09e-05, 2.32e-05])
In [174]:
# change y_pred format by select the index of maximum probability
y_pred_list = [int(np.where(i == np.amax(i))[0]) for i in y_pred]
y_pred_ar = np.array(y_pred_list)
y_pred_ar
Out[174]:
array([0, 0, 0, ..., 0, 0, 0])
In [175]:
print(classification_report(y_test, y_pred_ar))
              precision    recall  f1-score   support

           0       1.00      0.96      0.98     99152
           1       0.00      0.20      0.00        20
           2       0.00      0.21      0.00        14
           3       0.03      0.08      0.05        12

    accuracy                           0.96     99198
   macro avg       0.26      0.36      0.26     99198
weighted avg       1.00      0.96      0.98     99198

In [176]:
fig, ax = plt.subplots(figsize=(8,5))
lgb.plot_importance(model, height=0.8, ax=ax)
ax.grid(False)
plt.ylabel('Feature', size=12)
plt.xlabel('Importance', size=12)
plt.title("Importance of the Features of our LightGBM Model", fontsize=12)
plt.show()

3. Country-US Only

In [177]:
# merge continent features with cluster
ft_us = pd.merge(cluster, ft2, on='fullVisitorId')
ft_us.head()
Out[177]:
fullVisitorId Recency RecencyCluster Frequency FrequencyCluster logRevenue logRevenueCluster OverallScore logRevenue_12 LTVCluster Segment_High Segment_Mid1 Segment_Mid2 Segment_Low country_US
0 8934116514970143966 4 3 3 1 16.612182 1 5 16.950570 1 0 0 1 0 1
1 3700714855829972615 5 3 8 1 19.586889 1 5 17.527862 1 0 0 1 0 1
2 0267692164162304307 19 3 3 1 16.165933 1 5 17.599018 1 0 0 1 0 1
3 4921798700570421930 59 3 5 1 17.087358 1 5 17.001863 1 0 0 1 0 1
4 1956114514458489728 0 3 6 1 17.656467 1 5 16.498585 1 0 0 1 0 1
In [178]:
X = ft_us.drop(['LTVCluster','logRevenue_12','fullVisitorId'],axis=1)
y = ft_us['LTVCluster']
In [179]:
X_columns = ft_us.drop(['LTVCluster','logRevenue_12','fullVisitorId'],axis=1).columns
y_column = ['LTVCluster']
In [180]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=56)
In [181]:
# using SMOTE for unbalanced data
X_train, y_train= SMOTE().fit_resample(X_train, y_train)
In [182]:
# change X, y to dataframe so that we can keep column names
X_train = pd.DataFrame(X_train, columns=X_columns)
y_train = pd.DataFrame(y_train, columns=y_column)
In [183]:
# take a look at y after SMOTE, 4 clusters have the same samples now
y_train.hist()
Out[183]:
array([[<matplotlib.axes._subplots.AxesSubplot object at 0x00000264A73329E8>]],
      dtype=object)
In [184]:
# fitting lightGBM
params = {"objective" : "multiclass",
          "num_class": 4,
          "metric" : "multi_error",
          "num_leaves" : 30,
          "min_child_weight" : 50,
          "learning_rate" : 0.1,
          "bagging_fraction" : 0.7,
          "feature_fraction" : 0.7,
          "reg_alpha": 0.15,
          "reg_lambda": 0.15,
          "min_child_weight": 50,
          "bagging_seed" : 420,
          "verbosity" : -1
         }
lg_train = lgb.Dataset(X_train, label=y_train)
lg_test = lgb.Dataset(X_test, label=y_test)
model = lgb.train(params, lg_train, 1000, valid_sets=[lg_test], early_stopping_rounds=50, verbose_eval=100)
Training until validation scores don't improve for 50 rounds.
[100]	valid_0's multi_error: 0.0409686
Early stopping, best iteration is:
[109]	valid_0's multi_error: 0.0406258
In [185]:
y_pred = model.predict(X_test)
In [186]:
y_pred[0]
Out[186]:
array([9.80e-01, 1.95e-02, 4.09e-04, 1.74e-04])
In [187]:
# change y_pred format by select the index of maximum probability
y_pred_list = [int(np.where(i == np.amax(i))[0]) for i in y_pred]
y_pred_ar = np.array(y_pred_list)
y_pred_ar
Out[187]:
array([0, 0, 0, ..., 0, 0, 0])
In [188]:
print(classification_report(y_test, y_pred_ar))
              precision    recall  f1-score   support

           0       1.00      0.96      0.98     99152
           1       0.00      0.25      0.00        20
           2       0.00      0.21      0.00        14
           3       0.02      0.08      0.03        12

    accuracy                           0.96     99198
   macro avg       0.25      0.38      0.25     99198
weighted avg       1.00      0.96      0.98     99198

In [189]:
fig, ax = plt.subplots(figsize=(8,5))
lgb.plot_importance(model, height=0.8, ax=ax)
ax.grid(False)
plt.ylabel('Feature', size=12)
plt.xlabel('Importance', size=12)
plt.title("Importance of the Features of our LightGBM Model", fontsize=12)
plt.show()

Adding continent and country_us has the almost the same result as adding country_us only. To avoid overfitting, we will select the third model, adding contry_us only.

In [190]:
from sklearn import metrics
from sklearn.utils.multiclass import unique_labels
class_names=[0,1,2,3] # name  of classes

def plot_confusion_matrix(y_true, y_pred, classes,
                          normalize=False,
                          title=None,
                          cmap=plt.cm.Blues):
    """
    This function prints and plots the confusion matrix.
    Normalization can be applied by setting `normalize=True`.
    """
    if not title:
        if normalize:
            title = 'Normalized confusion matrix'
        else:
            title = 'Confusion matrix, without normalization'

    # Compute confusion matrix
    cm = confusion_matrix(y_true, y_pred)
    # Only use the labels that appear in the data
    classes = classes
    if normalize:
        cm = cm.astype('float') / cm.sum(axis=0)[:, np.newaxis]
        print("Normalized confusion matrix")
    else:
        print('Confusion matrix, without normalization')

    print(cm)

    fig, ax = plt.subplots()
    im = ax.imshow(cm, interpolation='nearest', cmap=cmap)
    ax.figure.colorbar(im, ax=ax)
    # We want to show all ticks...
    ax.set(xticks=np.arange(cm.shape[1]),
           yticks=np.arange(cm.shape[0]),
           # ... and label them with the respective list entries
           xticklabels=classes, yticklabels=classes,
           title=title,
           ylabel='True label',
           xlabel='Predicted label')

    # Rotate the tick labels and set their alignment.
    plt.setp(ax.get_xticklabels(), rotation=45, ha="right",
             rotation_mode="anchor")

    # Loop over data dimensions and create text annotations.
    fmt = '.2f' if normalize else 'd'
    thresh = cm.max() / 2.
    for i in range(cm.shape[0]):
        for j in range(cm.shape[1]):
            ax.text(j, i, format(cm[i, j], fmt),
                    ha="center", va="center",
                    color="white" if cm[i, j] > thresh else "black")
    fig.tight_layout()
    return ax
In [191]:
np.set_printoptions(precision=2)

# Plot non-normalized confusion matrix
plot_confusion_matrix(y_test, y_pred_ar, classes=class_names,
                      title='Confusion matrix, without normalization')

# Plot normalized confusion matrix
plot_confusion_matrix(y_test, y_pred_ar, classes=class_names, normalize=True,
                      title='Normalized confusion matrix')

plt.show()
Confusion matrix, without normalization
[[95159  2261  1671    61]
 [   12     5     2     1]
 [    7     3     3     1]
 [    5     3     3     1]]
Normalized confusion matrix
[[1.00e+00 2.38e-02 1.76e-02 6.41e-04]
 [5.28e-03 2.20e-03 8.80e-04 4.40e-04]
 [4.17e-03 1.79e-03 1.79e-03 5.96e-04]
 [7.81e-02 4.69e-02 4.69e-02 1.56e-02]]
In [ ]: